Employee Promotion Prediction

HR analytics is the process of collecting and analysing Human Resource (HR) data in order to improve an organization's workforce. It can help to answer critical questions about the organization which enables better and data-driven decision-making. Managing promotions effectively is one of the most powerful ways leaders can drive their company’s success. Promotion’s process allows leaders to evaluate each employee and their potential to be promoted. Some analysis needs to be conducted to estimate the probability of getting promotion based on some features. Therefore, this project aims to perform predictive analysis to identify the employees most likely to get promoted.

The data size is over 54808 observations and was obtained from Kaggle. The variables which are available in this dataset are:

As it can be seen from the information above that our data contains 13 columns 8 of them are numerical variables and the others are categorical. Also, some missing data were observed in each of education and previous_year_rating columns.

Exploratory Data Analysis

The precentage of missing values for both education and previous_year_rating columns were less than 10%, therefore we can delete them and this will not affect the dataset.

The table above shows five statistical summary of numerical features in the dataset. The middle 50% of age is 34, with an average age of 35. The minimum value of number of trainings is one, whereas the maximum value is 10. However, the mean of previous year rating is 3, with minimum rating of 1 and maximum rating of 5. Furthermore, the mean and the median of length of service are 6 and 5 respectively. Finally, the avg training score has the mean of 63 with minimum and maximum score of 39 and 99 respectively

It can be seen from the plot above that around 29.2% of employees were in Sales& Marketing department followed by 21.7% of employees were in operations department. Unlike the other departments like R&D and Legal that had the lower percentage of 1.84% 1.78% respectively.

The figure above shows that around 8% of employees most likely to get promoted.

The plot above shows the majority of the employees (79.6%) had previous year rating of 3 or above.

The majority of the employees had years of experience between 2 to 7, while the minority of them had over 20 years experience.

It can be seen that among all employees who got one training the last year that 9% of them are likely to get promoted. Moreover, 7.7% of employees who got two trainings are likely to get promoted. The distribution indicates that few employees were getting number of trainings above 4.

Some insights from data

Q1-What is the probability of getting Promoted, If an employee has won an award?

Therefore, The probability of an employee who won awards to get promotion is : 44.8%

Q2-How the employees differ by getting promotion based on their level of Education?

Among all employees who had Master's and above degree 9.89% of them were more likely to get promotion.

Among all employees who had Bachelor degree 8.18% of them were more likely to get promotion.

Among all employees who had Below Secondary degree 7.86% of them were more likely to get promotion.

Q3-Is there any statistically significant difference in the average previous year rating between promoted and non-promoted employees?

The T-test statistic indicates that there is a statistically significant difference in the average previous year rating between promoted and non-promoted employees.

The pairplot can be helpful to explore if we have good predictors that can split our classes. By examining the diagonal line, we can see that the overlap area is high in most features.

Formal Data Analysis

Data pre-processing

Log Transformation: Apply log transformation on skewed variables(no_of_trainings and length_of_service) will help us to improve the model performance in modeling part.

Synthetic Minority Oversampling:SMOTE will be used as Synthetic Minority Oversampling technique, this can handle the issue of imbalanced classes.

Training and testing Data

Feature Scaling

Models

logistic regression model showes a good fit to the data with 86% accuracy, 87% for precision and 86% for each of recall and F1-score. Also, confusion matrix shows that the model predicted that 2541 of employees are not recommended for promotion but they actually are. On other hand, the logistic regression model predicted that 1127 of employees are recommended for promotion but actually they are not.

Random Forest model showes a great fit to the data with 95% for each of recall, precision, F1-score, and accuracy. Also, confusion matrix shows that the model predicted 842 of employees are not recommended for promotion but they actually are. On other hand, the random forest model predicted that 393 of employees are recommended for promotion but actually they are not.

The main objective of this project is to predict the actual employees who most likely to get promoted. The recall of random forest model indicates that 95% of actual promoted employees the classifier correctly identified. Moreover, the random forest model performed well in reducing the False Negative rate in comparison with logistic regression model. Therefore, the random forest model will be chosen as predictive model of this project.

Feature importance for random forest

The Feature importance figure for random forest shows that previous year rating is the most important feature in classify our target. The other important features are avg training score, length of service, age, and no of trainings.

Conclution

We have found a good model with an appropriate performance to predict if an employee will be recommended for promotion.

Thank you.